UK 2012 Accidents Dataset - Association Rules

MSDS 7331 Data Mining - Section 403 - Lab 3

Group 3:

  • Bhuvana Adur Kannan (Team Lead)
  • Selwyn Samuel
  • Anjli Solsi

Business Understanding

Points: 10

  • Describe the purpose of the data set you selected (i.e., why was this data collected in the first place?).
  • How will you measure the effectiveness of a good algorithm?
  • Why does your chosen validation method make sense for this specific dataset and the stakeholders needs?

Purpose of Dataset & Effectiveness of good algorithms

The UK government has gathered traffic accident data from the years 2000 to 2016, capturing data for over 1.6 million accidents. The data is collected from police reports and does not include fender bender type incidents. In order to reduce the complexity and resource requirements, we will be working on the accidents occured during 2012. The dataset contains a variety of variables related to these accidents.

This dataset is originally from the UK government website but is hosted by BigML Inc. It can be downloaded from the link: https://bigml.com/user/czuriaga/gallery/dataset/525dcd15035d076e7f00e3ac.

From a business perspective, we'd like to establish the association rules that can be derived from the dataset. These association rules can also be referred to as patterns that become apparent through data analysis. Specifically, we are looking at the factors/variables that are related to the severity of an accident. In our analysis, the three different severity levels are slight, serious, and fatal.

For our analysis, we performed association rule learning using the apriori algorithm. To measure the effectiveness of our association rules analysis, we will be looking at the correlation between minimum support, the number of rules derived or frequent itemsets, as well as the data processing time.

Considering the information presented in this dataset, the stakeholders could vary from everyday drivers to first responders to urban planners. This method of analysis and results will provide a better understanding of the factors surrounding different types of accidents, and allow them to re-route away from accident prone areas, plan and prepare for future incidents, or consider the critical factors when creating new roads or structures.

⏫ Back to Top

Project Goals

We wanted to use the association rules algorithms in R to derive meaningful correlations between factors contributing to the accidents in the UK. Since the apriori algorithm is more stable in R, we decided to do the entire lab using R and Jupyter notebook in the Anaconda environment.

  • Associated Rules that can explain the combination of factors in the UK Accidents
  • Determine the factors/rules for the different accident severity:
    • Fatal
    • Serious
    • Slight

⏫ Back to Top

Data Understanding 1

Points: 10

Visualize the any important attributes appropriately. Important: Provide an interpretation for any charts or graphs.

Read the data files

We used the same data file for the previous labs. The accident data is in CSV format (UKaccidents.csv) that has close to 300k observations and 55 attributes. We created another CSV file (UKAccidents_uniquecolumnsR.csv) to rename and skip some of the columns that will not be considered for the association algorithms. This CSV contained two columns, one for the old attribute name and the other for the new attribute name. The attributes that we decided to skip will have "Skip" as the new attribute name. We will then programmatically remove those columns from the dataframe.

In [1]:
# Accident data file
accidents.file = '../../data/UKAccidents.csv'  

accident.data <- read.csv(accidents.file)  

# Excel file with the current and new attribute names. The attributes that needed to be removed are named as "Skip" and 
# they will be removed programmatically.
R.column.names.map.file  = '../../data/UKAccidents_uniquecolumnsR.csv'
R.column.names.map <- read.csv(R.column.names.map.file)
In [2]:
head(accident.data)
Location.Easting.OSGRLocation.Northing.OSGRLongitudeLatitudePolice.ForceAccident.SeverityNumber.of.VehiclesNumber.of.CasualtiesDateLocal.Authority..District.Age.of.VehicleDriver.IMD.DecileDriver.Home.Area.TypeDate.yearDate.monthDate.day.of.monthDate.day.of.weekDate.hourDate.minuteDate.second
527200 178760 -0.169101 51.49343 Metropolitan Police Slight 2 1 2012-01-19T20:35:00 Kensington and Chelsea 5 Less deprived 30-40% Urban area 2012 1 19 4 20 35 0
527200 178760 -0.169101 51.49343 Metropolitan Police Slight 2 1 2012-01-19T20:35:00 Kensington and Chelsea NA More deprived 20-30% Urban area 2012 1 19 4 20 35 0
524930 181430 -0.200838 51.51793 Metropolitan Police Slight 2 1 2012-01-04T17:00:00 Westminster NA Data missing or out of rangeData missing or out of range2012 1 4 3 17 0 0
524930 181430 -0.200838 51.51793 Metropolitan Police Slight 2 1 2012-01-04T17:00:00 Westminster NA More deprived 40-50% Urban area 2012 1 4 3 17 0 0
525860 178080 -0.188636 51.48762 Metropolitan Police Slight 2 1 2012-01-10T10:07:00 Kensington and Chelsea 8 More deprived 20-30% Urban area 2012 1 10 2 10 7 0
525860 178080 -0.188636 51.48762 Metropolitan Police Slight 2 1 2012-01-10T10:07:00 Kensington and Chelsea 2 More deprived 10-20% Urban area 2012 1 10 2 10 7 0
In [3]:
dim(accident.data)
  1. 265877
  2. 55
In [4]:
head(R.column.names.map)
OriginalNew
Location.Easting.OSGR Skip
Location.Northing.OSGRSkip
Longitude Skip
Latitude Skip
Police.Force PoliceForce
Accident.Severity Severity

As an example, we can see that location attributes will be removed from the dataframes as they may not help us in the association rules unless converted to categorical regions.

In [5]:
dim(R.column.names.map)
  1. 55
  2. 2

We have an entry for each of the 55 attributes in the mapping file.

Duplicate Data

Some of the records in the dataset appear to be duplicate, but that was not the case once they were observed closely. When two or more vehicles are involved in an accident, there is an instance for each vehicle with the location and other features being the same; however, the driver and vehicle information is different.

Missing Values

Missing values in the dataset are not a problem for doing association rules. When the data is transformed into the transaction sparse format, the missing values will provide a good context for the association rules.

Rename the attributes

In [6]:
accident.data.temp <- accident.data
names(accident.data.temp) <- R.column.names.map$New
head(accident.data.temp)
SkipSkip.1Skip.2Skip.3PoliceForceSeveritySkip.4Skip.5Skip.6DistrictAuthorityAgeOfVehicleDriverIMDDecileDriverHomeAreaTypeSkipMonthDateOfMonthDayHourSkip.1Skip.2
527200 178760 -0.169101 51.49343 Metropolitan Police Slight 2 1 2012-01-19T20:35:00 Kensington and Chelsea 5 Less deprived 30-40% Urban area 2012 1 19 4 20 35 0
527200 178760 -0.169101 51.49343 Metropolitan Police Slight 2 1 2012-01-19T20:35:00 Kensington and Chelsea NA More deprived 20-30% Urban area 2012 1 19 4 20 35 0
524930 181430 -0.200838 51.51793 Metropolitan Police Slight 2 1 2012-01-04T17:00:00 Westminster NA Data missing or out of rangeData missing or out of range2012 1 4 3 17 0 0
524930 181430 -0.200838 51.51793 Metropolitan Police Slight 2 1 2012-01-04T17:00:00 Westminster NA More deprived 40-50% Urban area 2012 1 4 3 17 0 0
525860 178080 -0.188636 51.48762 Metropolitan Police Slight 2 1 2012-01-10T10:07:00 Kensington and Chelsea 8 More deprived 20-30% Urban area 2012 1 10 2 10 7 0
525860 178080 -0.188636 51.48762 Metropolitan Police Slight 2 1 2012-01-10T10:07:00 Kensington and Chelsea 2 More deprived 10-20% Urban area 2012 1 10 2 10 7 0

Drop the unwanted columns

The columns that need to be dropped will start with "Skip".

In [7]:
library(data.table)

accident.df = accident.data.temp[,!(names(accident.data.temp) %like% "^Skip")]
names(accident.df)
  1. 'PoliceForce'
  2. 'Severity'
  3. 'DistrictAuthority'
  4. 'HighwayAuthority'
  5. 'RoadClass1'
  6. 'RoadType'
  7. 'SpeedLimit'
  8. 'JunctionDetail'
  9. 'JunctionControl'
  10. 'RoadClass2'
  11. 'PedestrianCrossingControl'
  12. 'PedestrianCrossingFacility'
  13. 'LightCondition'
  14. 'WeatherCondition'
  15. 'RoadSurfaceCondition'
  16. 'SpecialConditionsAtSite'
  17. 'CarriagewayHazards'
  18. 'UrbanorRuralArea'
  19. 'DidPoliceOfficerAttend'
  20. 'VehicleType'
  21. 'Tow'
  22. 'VehicleManoeuvre'
  23. 'VehicleLocation'
  24. 'JunctionLocation'
  25. 'SkiddingAndOverturning'
  26. 'HitObjectInCarriageway'
  27. 'VehicleLeavingCarriageway'
  28. 'HitObjectOffCarriageway'
  29. 'PointOfImpact'
  30. 'LeftHandDrive'
  31. 'JourneyPurpose'
  32. 'Sex'
  33. 'AgeBand'
  34. 'EngineCapacity'
  35. 'PropulsionCode'
  36. 'AgeOfVehicle'
  37. 'DriverIMDDecile'
  38. 'DriverHomeAreaType'
  39. 'Month'
  40. 'DateOfMonth'
  41. 'Day'
  42. 'Hour'

Recode continous variables to categorical variables

The following columns are to be recoded:

  • EngineCapacity
  • AgeOfVehicle
In [8]:
accident.df$EngineCapacity <- cut(accident.df$EngineCapacity, breaks=c(0, 750, 2000, Inf), labels=c("Low","Medium","High"))
In [9]:
accident.df$AgeOfVehicle <- cut(accident.df$AgeOfVehicle, breaks=c(0, 3, 8, Inf), labels=c("New","Medium","Old"))
In [10]:
accident.df$EngineCapacity <- factor(accident.df$EngineCapacity)
accident.df$AgeOfVehicle <- factor(accident.df$AgeOfVehicle)
In [11]:
head(accident.df)
PoliceForceSeverityDistrictAuthorityHighwayAuthorityRoadClass1RoadTypeSpeedLimitJunctionDetailJunctionControlRoadClass2AgeBandEngineCapacityPropulsionCodeAgeOfVehicleDriverIMDDecileDriverHomeAreaTypeMonthDateOfMonthDayHour
Metropolitan Police Slight Kensington and Chelsea Kensington and Chelsea A Single carriageway 30 Crossroads Auto traffic signal C 16 - 20 Medium Petrol Medium Less deprived 30-40% Urban area 1 19 4 20
Metropolitan Police Slight Kensington and Chelsea Kensington and Chelsea A Single carriageway 30 Crossroads Auto traffic signal C 36 - 45 NA NA More deprived 20-30% Urban area 1 19 4 20
Metropolitan Police Slight Westminster Westminster B Single carriageway 30 T or staggered junction Give way or uncontrolled Unclassified -1 NA NA Data missing or out of rangeData missing or out of range1 4 3 17
Metropolitan Police Slight Westminster Westminster B Single carriageway 30 T or staggered junction Give way or uncontrolled Unclassified 16 - 20 NA NA More deprived 40-50% Urban area 1 4 3 17
Metropolitan Police Slight Kensington and Chelsea Kensington and Chelsea A One way street 30 Crossroads Give way or uncontrolled Unclassified 56 - 65 Medium Petrol Medium More deprived 20-30% Urban area 1 10 2 10
Metropolitan Police Slight Kensington and Chelsea Kensington and Chelsea A One way street 30 Crossroads Give way or uncontrolled Unclassified 36 - 45 Low Petrol New More deprived 10-20% Urban area 1 10 2 10
In [12]:
dim(accident.df)
  1. 265877
  2. 42

Association Plots

The following plots display the severity of accidents with different factors, while identifying the Pearson residuals. Residuals demonstrate the difference in expected and observed number of each accident severity, while the Pearson residual is the raw residual divided by the square root of the variance.

In [13]:
if(! "vcd" %in% installed.packages()) install.packages("vcd", depend = TRUE)
library(vcd)
Loading required package: grid
In [14]:
assoc(~ Severity + SpeedLimit, data=accident.df, shade=TRUE, abbreviate_labs=6)

In the above plot, we see multiple values where more or less accidents occurred than expected. For fatal severity, significantly more accidents occurred at a speed limit of 60 than was expected, and even so at 50 and 70. There is a similar trend for speed limit 60 and serious severity. The opposite is true for the slight severity, with less accidents occurring than expected at a speed limit of 60, and more accidents occurring than expected at a speed limit of 30.

The overall trend shows that more severe accidents tend to occur at higher speed limits, and less severe accidents tend to occur at lower speed limits.

In [15]:
assoc(~ Severity + Sex, data=accident.df, shade=TRUE, abbreviate_labs=6)

In the above plot, we see multiple values where more or less accidents occurred than expected. For accidents with a fatal severity, more accidents occur than expected for males and less for females. The same trend is seen for serious accidents. The tables turn for slight accidents, with more accidents occurring than expected for females and less for males. The overall trend shows that males tend to be more involved in serious and fatal accidents, while females are more involved in less severe accidents.

In [16]:
assoc(~ Severity + Month, data=accident.df, shade=TRUE, abbreviate_labs=6)

In the above plot, we see a few values where more or less accidents occurred than expected. The largest residual is seen in serious accidents during the month of May. There are also more fatal accidents than expected in December. There are less serious accidents than expected in the months of January, April, October, and December. The slight accidents do not seem to vary too much throughout the year.

There is no overall trend, however we can assert that the number of fatal accidents in December may be due to the holiday rush and/or inclement weather.

Data Understanding 2

Points: 10

Choose and explain your evaluation metrics that you will use (i.e., accuracy, precision, recall, F-measure, or any metric we have discussed). Why are the measure(s) appropriate for analyzing the results of your modeling? Give a detailed explanation backing up any assertions.

Evaluation Metrics

Support and confidence are generally the primary metrics for evaluating the quality of the association rules generated by the model, and also what we will use as our metrics. Both support and confidence can be used to rank the generated rules, and can therefore determine how useful the predictions are. More importantly, both metrics are used to determine the validity of rules generated.

The support of a rule indicates how frequently the items in the rule occur together. Support is the ratio of transactions that include all the items in the first and second half of the hypothesis to the number of total transactions.

The confidence of a rule indicates the probability of both the first and second half of the hypothesis appearing in the same transaction. Confidence is the ratio of the rule support to the number of transactions that include the first half of the hypothesis.

Lift is another metric than can be used to evaluate the quality of a rule. It is defined as the confidence of the combination of items divided by the support of the second half of the hypothesis.

⏫ Back to Top

Association Rules Data Preparation

Data Preparation

We have to prepare the data in the Basket sparse format for the apriori algorithm. We followed the steps described below to achieve that:

Data Cleanup

Read the original CSV file, delete the unwanted columns and rename the columns. We performed this step before the data visualization. We used another manually created CSV file that contained the map of the old and the new column names. We had 265877 observations and 42 attributes.

Generate Transactions

We created an intermediate transactions CSV file that had each observation transposed into a = string along with a unique ID for each accident. We achieved this by using another CSV file that contained the attribute names along with the unique list of values for each attribute.

While coming up with the unique list of values, we removed attribute values that indicated missing data such as "N/A" and "Data Missing" and blank values.

We used apply and lapply functions in R to get the list of transactions with the unique accident IDs and we stored them in the transactions.csv file. We had a total of 6,491,814 rows with two columns in the transactions.csv file.

Create the Basket data

We used read.transactions function in arules library to read the transactions.csv into the basket format required by apriori algorithm.

In [17]:
# Excel file containing the unique values for each of the attribute. There will be number of NAs values which will be removed.
unique.value.file  = '../../data/UKAccidents_uniquevalues.csv'
unique.values <- read.csv(unique.value.file)
In [18]:
#Remove NAs from unique.values
unique.values <- na.omit(unique.values)

head(unique.values)
PoliceForceSeverityDistrictAuthorityHighwayAuthorityRoadClass1RoadTypeSpeedLimitJunctionDetailJunctionControlRoadClass2AgeBandEngineCapacityPropulsionCodeAgeOfVehicleDriverIMDDecileDriverHomeAreaTypeMonthDateOfMonthDayHour
Metropolitan Police Slight Kensington and Chelsea Kensington and Chelsea A Single carriageway 30 Crossroads Auto traffic signal A 16 - 20 Low Petrol New Less deprived 30-40% Urban area 1 1 1 0
City of London Serious Westminster Westminster B One way street 50 T or staggered junction Give way or uncontrolled B 36 - 45 Medium Heavy oil Medium More deprived 20-30% Small town 2 2 2 1
Cumbria Fatal Hammersmith and Fulham Hammersmith and Fulham C Roundabout 40 Roundabout Authorised person C 56 - 65 High Hybrid electric Old More deprived 40-50% Rural 3 3 3 2
Lancashire Brent Brent Motorway Dual carriageway 20 Not at junction or within 20 metresStop sign Motorway 46 - 55 Electric More deprived 10-20% 4 4 4 3
Merseyside City of London City of London A(M) Slip road 70 Private drive or entrance A(M) 66 - 75 Gas/Bi-fuel More deprived 30-40% 5 5 5 4
Greater Manchester Islington Islington 60 Slip road 26 - 35 Gas Less deprived 20-30% 6 6 6 5
In [19]:
# Function to transpose a single row of accident data into a column of factors in the format <factor name> = <value>. This
# function will be called from apply function call which will iterate through all the rows in the dataset.
getRowValues <- function(p.onerow, p.unique.value) {
  i <- 1
  f <- character()
  a <- numeric()
  factors <- names(p.onerow)
  for(factor in factors) {
    #print(p.onerow[[factor]])
    if(length(p.onerow[[factor]]) > 0 & p.onerow[[factor]] %in% p.unique.value[, factor]) {
      f[i] <- paste(factor, '=', p.onerow[factor])
      i <- i + 1
    }
  }
  return(f)
}

accident.factors <- apply(accident.df, 1, getRowValues, unique.values)

length(accident.factors)
265877
In [20]:
head(accident.factors)
    1. 'PoliceForce = Metropolitan Police'
    2. 'Severity = Slight'
    3. 'DistrictAuthority = Kensington and Chelsea'
    4. 'HighwayAuthority = Kensington and Chelsea'
    5. 'RoadClass1 = A'
    6. 'RoadType = Single carriageway'
    7. 'SpeedLimit = 30'
    8. 'JunctionDetail = Crossroads'
    9. 'JunctionControl = Auto traffic signal'
    10. 'RoadClass2 = C'
    11. 'PedestrianCrossingControl = None within 50 metres'
    12. 'PedestrianCrossingFacility = Pedestrian phase at traffic signal junction'
    13. 'LightCondition = Darkness - lights lit'
    14. 'WeatherCondition = Fine no high winds'
    15. 'RoadSurfaceCondition = Dry'
    16. 'UrbanorRuralArea = Urban'
    17. 'DidPoliceOfficerAttend = Yes'
    18. 'VehicleType = Car'
    19. 'VehicleManoeuvre = Turning right'
    20. 'VehicleLocation = On main c\'way - not in restricted lane'
    21. 'JunctionLocation = Mid Junction - on roundabout or on main road'
    22. 'PointOfImpact = Nearside'
    23. 'LeftHandDrive = No'
    24. 'Sex = Female'
    25. 'AgeBand = 16 - 20'
    26. 'EngineCapacity = Medium'
    27. 'PropulsionCode = Petrol'
    28. 'AgeOfVehicle = Medium'
    29. 'DriverIMDDecile = Less deprived 30-40%'
    30. 'DriverHomeAreaType = Urban area'
    1. 'PoliceForce = Metropolitan Police'
    2. 'Severity = Slight'
    3. 'DistrictAuthority = Kensington and Chelsea'
    4. 'HighwayAuthority = Kensington and Chelsea'
    5. 'RoadClass1 = A'
    6. 'RoadType = Single carriageway'
    7. 'SpeedLimit = 30'
    8. 'JunctionDetail = Crossroads'
    9. 'JunctionControl = Auto traffic signal'
    10. 'RoadClass2 = C'
    11. 'PedestrianCrossingControl = None within 50 metres'
    12. 'PedestrianCrossingFacility = Pedestrian phase at traffic signal junction'
    13. 'LightCondition = Darkness - lights lit'
    14. 'WeatherCondition = Fine no high winds'
    15. 'RoadSurfaceCondition = Dry'
    16. 'UrbanorRuralArea = Urban'
    17. 'DidPoliceOfficerAttend = Yes'
    18. 'VehicleType = Pedal cycle'
    19. 'VehicleManoeuvre = Going ahead other'
    20. 'VehicleLocation = On main c\'way - not in restricted lane'
    21. 'JunctionLocation = Mid Junction - on roundabout or on main road'
    22. 'PointOfImpact = Front'
    23. 'LeftHandDrive = No'
    24. 'Sex = Male'
    25. 'AgeBand = 36 - 45'
    26. 'DriverIMDDecile = More deprived 20-30%'
    27. 'DriverHomeAreaType = Urban area'
    1. 'PoliceForce = Metropolitan Police'
    2. 'Severity = Slight'
    3. 'DistrictAuthority = Westminster'
    4. 'HighwayAuthority = Westminster'
    5. 'RoadClass1 = B'
    6. 'RoadType = Single carriageway'
    7. 'SpeedLimit = 30'
    8. 'JunctionDetail = T or staggered junction'
    9. 'JunctionControl = Give way or uncontrolled'
    10. 'PedestrianCrossingControl = None within 50 metres'
    11. 'PedestrianCrossingFacility = No physical crossing facilities within 50 metres'
    12. 'LightCondition = Darkness - lights lit'
    13. 'WeatherCondition = Fine no high winds'
    14. 'RoadSurfaceCondition = Dry'
    15. 'UrbanorRuralArea = Urban'
    16. 'DidPoliceOfficerAttend = Yes'
    17. 'VehicleType = Car'
    18. 'VehicleManoeuvre = Going ahead other'
    19. 'VehicleLocation = On main c\'way - not in restricted lane'
    20. 'JunctionLocation = Mid Junction - on roundabout or on main road'
    21. 'LeftHandDrive = No'
    1. 'PoliceForce = Metropolitan Police'
    2. 'Severity = Slight'
    3. 'DistrictAuthority = Westminster'
    4. 'HighwayAuthority = Westminster'
    5. 'RoadClass1 = B'
    6. 'RoadType = Single carriageway'
    7. 'SpeedLimit = 30'
    8. 'JunctionDetail = T or staggered junction'
    9. 'JunctionControl = Give way or uncontrolled'
    10. 'PedestrianCrossingControl = None within 50 metres'
    11. 'PedestrianCrossingFacility = No physical crossing facilities within 50 metres'
    12. 'LightCondition = Darkness - lights lit'
    13. 'WeatherCondition = Fine no high winds'
    14. 'RoadSurfaceCondition = Dry'
    15. 'UrbanorRuralArea = Urban'
    16. 'DidPoliceOfficerAttend = Yes'
    17. 'VehicleType = Pedal cycle'
    18. 'VehicleManoeuvre = Going ahead other'
    19. 'VehicleLocation = On main c\'way - not in restricted lane'
    20. 'JunctionLocation = Mid Junction - on roundabout or on main road'
    21. 'LeftHandDrive = No'
    22. 'Sex = Male'
    23. 'AgeBand = 16 - 20'
    24. 'DriverIMDDecile = More deprived 40-50%'
    25. 'DriverHomeAreaType = Urban area'
    1. 'PoliceForce = Metropolitan Police'
    2. 'Severity = Slight'
    3. 'DistrictAuthority = Kensington and Chelsea'
    4. 'HighwayAuthority = Kensington and Chelsea'
    5. 'RoadClass1 = A'
    6. 'RoadType = One way street'
    7. 'SpeedLimit = 30'
    8. 'JunctionDetail = Crossroads'
    9. 'JunctionControl = Give way or uncontrolled'
    10. 'PedestrianCrossingControl = None within 50 metres'
    11. 'PedestrianCrossingFacility = Pelican, puffin, toucan or similar non-junction pedestrian light crossing'
    12. 'LightCondition = Daylight'
    13. 'WeatherCondition = Fine no high winds'
    14. 'RoadSurfaceCondition = Dry'
    15. 'UrbanorRuralArea = Urban'
    16. 'DidPoliceOfficerAttend = Yes'
    17. 'VehicleType = Car'
    18. 'VehicleManoeuvre = Going ahead other'
    19. 'VehicleLocation = On main c\'way - not in restricted lane'
    20. 'JunctionLocation = Mid Junction - on roundabout or on main road'
    21. 'PointOfImpact = Front'
    22. 'LeftHandDrive = No'
    23. 'Sex = Female'
    24. 'AgeBand = 56 - 65'
    25. 'EngineCapacity = Medium'
    26. 'PropulsionCode = Petrol'
    27. 'AgeOfVehicle = Medium'
    28. 'DriverIMDDecile = More deprived 20-30%'
    29. 'DriverHomeAreaType = Urban area'
    1. 'PoliceForce = Metropolitan Police'
    2. 'Severity = Slight'
    3. 'DistrictAuthority = Kensington and Chelsea'
    4. 'HighwayAuthority = Kensington and Chelsea'
    5. 'RoadClass1 = A'
    6. 'RoadType = One way street'
    7. 'SpeedLimit = 30'
    8. 'JunctionDetail = Crossroads'
    9. 'JunctionControl = Give way or uncontrolled'
    10. 'PedestrianCrossingControl = None within 50 metres'
    11. 'PedestrianCrossingFacility = Pelican, puffin, toucan or similar non-junction pedestrian light crossing'
    12. 'LightCondition = Daylight'
    13. 'WeatherCondition = Fine no high winds'
    14. 'RoadSurfaceCondition = Dry'
    15. 'UrbanorRuralArea = Urban'
    16. 'DidPoliceOfficerAttend = Yes'
    17. 'VehicleType = Motorcycle 50cc and under'
    18. 'VehicleManoeuvre = Going ahead other'
    19. 'VehicleLocation = On main c\'way - not in restricted lane'
    20. 'JunctionLocation = Mid Junction - on roundabout or on main road'
    21. 'PointOfImpact = Front'
    22. 'LeftHandDrive = No'
    23. 'Sex = Male'
    24. 'AgeBand = 36 - 45'
    25. 'EngineCapacity = Low'
    26. 'PropulsionCode = Petrol'
    27. 'AgeOfVehicle = New'
    28. 'DriverIMDDecile = More deprived 10-20%'
    29. 'DriverHomeAreaType = Urban area'
In [21]:
library(dplyr)
options(warn=-1)

accident.transactions <- accident.factors %>% lapply(as.data.frame) %>% bind_rows(.id = "ID") %>% mutate(ID = as.numeric(ID))
Attaching package: 'dplyr'

The following objects are masked from 'package:data.table':

    between, first, last

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

In [22]:
dim(accident.transactions)
names(accident.transactions) <- c("AccidentID", "Factor")

accident.transactions %>% group_by('AccidentID') %>% summarize(accident.count = n_distinct(AccidentID))
accident.transactions %>% group_by('Factor') %>% summarize(factor.count = n_distinct(Factor))
  1. 6491814
  2. 2
"AccidentID"accident.count
AccidentID265877
"Factor"factor.count
Factor199
In [23]:
head(accident.transactions, 20)
tail(accident.transactions, 20)
AccidentIDFactor
1 PoliceForce = Metropolitan Police
1 Severity = Slight
1 DistrictAuthority = Kensington and Chelsea
1 HighwayAuthority = Kensington and Chelsea
1 RoadClass1 = A
1 RoadType = Single carriageway
1 SpeedLimit = 30
1 JunctionDetail = Crossroads
1 JunctionControl = Auto traffic signal
1 RoadClass2 = C
1 PedestrianCrossingControl = None within 50 metres
1 PedestrianCrossingFacility = Pedestrian phase at traffic signal junction
1 LightCondition = Darkness - lights lit
1 WeatherCondition = Fine no high winds
1 RoadSurfaceCondition = Dry
1 UrbanorRuralArea = Urban
1 DidPoliceOfficerAttend = Yes
1 VehicleType = Car
1 VehicleManoeuvre = Turning right
1 VehicleLocation = On main c'way - not in restricted lane
AccidentIDFactor
6491795265877 PedestrianCrossingControl = None within 50 metres
6491796265877 PedestrianCrossingFacility = Pelican, puffin, toucan or similar non-junction pedestrian light crossing
6491797265877 LightCondition = Darkness - lights lit
6491798265877 WeatherCondition = Fine no high winds
6491799265877 RoadSurfaceCondition = Wet or damp
6491800265877 UrbanorRuralArea = Rural
6491801265877 DidPoliceOfficerAttend = No
6491802265877 VehicleType = Bus or coach (17 or more pass seats)
6491803265877 VehicleManoeuvre = Turning right
6491804265877 VehicleLocation = On main c'way - not in restricted lane
6491805265877 JunctionLocation = Not at or within 20 metres of junction
6491806265877 PointOfImpact = Nearside
6491807265877 LeftHandDrive = No
6491808265877 JourneyPurpose = Journey as part of work
6491809265877 Sex = Male
6491810265877 AgeBand = 66 - 75
6491811265877 EngineCapacity = High
6491812265877 PropulsionCode = Heavy oil
6491813265877 AgeOfVehicle = Old
6491814265877 DriverHomeAreaType = Urban area
In [41]:
# write transactions to CSV file
write.csv(accident.transactions, "../../data/accident_transactions.csv", row.names = FALSE, quote = FALSE)
In [24]:
# Read the transactions CSV file
library(arules)
library(arulesViz)
library(ggplot2)
Loading required package: Matrix

Attaching package: 'arules'

The following object is masked from 'package:dplyr':

    recode

The following objects are masked from 'package:base':

    abbreviate, write

In [25]:
transactions.obj <- read.transactions(file = "../../data/accident_transactions.csv", format = "single", 
                                      sep = ",",
                                      cols = c("AccidentID", "Factor"), 
                                      rm.duplicates = FALSE,
                                      quote = "", skip = 0,
                                      encoding = "unknown")
In [26]:
transactions.obj
transactions in sparse format with
 265877 transactions (rows) and
 199 items (columns)

In the sparse basket format, there are 265,877 rows and 199 items. The rows correspond to the original number of observations and the columns are factors with their levels.

Modeling and Evaluation 1

Points: 10

Train and adjust parameters

Apriori Algorithm

Association rules analysis does not involve the standard forms of training a dataset that has been performed in the previous labs. However, it does involve specific formatting as detailed above and modeling with the Apriori algorithm. In this section, the Apriori algorithm is implemented and rules are displayed.

Association rule mining can be divided into a two-step approach. First is frequent itemset generation, which involves finding all combination of items in a set of transactions that occur with a specified minimum frequency. The second step is rule generation, which involves calculating the rules that express the probable co-occurence of items within frequent itemsets. By using the Apriori algorithm, the probability of an item being present in a frequent itemset is calculated, given that another item or items is present.

In [27]:
as.data.frame(head(sort(itemFrequency(transactions.obj, type = "absolute")
                        , decreasing = TRUE), 10) )  # Most  frequent
head(sort(itemFrequency(transactions.obj, type = "absolute"), decreasing = TRUE), 10)
LeftHandDrive = No264726
PedestrianCrossingControl = None within 50 metres264396
VehicleLocation = On main c'way - not in restricted lane260235
Severity = Slight228011
PedestrianCrossingFacility = No physical crossing facilities within 50 metres219896
DidPoliceOfficerAttend = Yes219239
WeatherCondition = Fine no high winds210543
LightCondition = Daylight197757
RoadType = Single carriageway196818
VehicleType = Car191604
In [28]:
as.data.frame(head(sort(itemFrequency(transactions.obj, type = "absolute")
                        , decreasing = FALSE), 10))  # Least frequent
head(sort(itemFrequency(transactions.obj, type = "absolute"), decreasing = FALSE), 10)
SpeedLimit = 10 1
PropulsionCode = Gas 6
PropulsionCode = Electric 18
SkiddingAndOverturning = Jackknifed and overturned 45
Tow = Double or multiple trailer 82
SkiddingAndOverturning = Jackknifed 90
PropulsionCode = Petrol/Gas (LPG) 94
HitObjectInCarriageway = Road works 97
VehicleLocation = Busway (including guided busway)102
Tow = Caravan139
In [29]:
itemFrequencyPlot(transactions.obj, topN = 25)

The plot above displays the most frequent items in the dataset. The first bar indicates that accidents usually involve right hand drive, which makes sense for the UK. The second bar indicates accidents occur in places where there is no pedestrian crossing control within 50 meters, and the third that the accidents occur on main crossways. A couple more interesting items are: light condition of daylight, sex of male, urban area, speed limit 30, and vehicle maneuver of going ahead. One bar is important to discuss, and that is severity of slight. The data has significantly more instances of slight accidents than serious and fatal, but we attempt to generate rules for all with the values used as inputs/parameters.

Implement Apriori Algorithm

In [30]:
# Interest Measures
support    <- 0.01

# Frequent item sets
parameters = list(
  support = support,
  minlen  = 2,  # Minimal number of items per item set
  maxlen  = 10, # Maximal number of items per item set
  target  = "frequent itemsets"
)

freq.items <- apriori(transactions.obj, parameter = parameters)
# Let us examine our freq item sites
freq.items.df <- data.frame(item_set = labels(freq.items), support = freq.items@quality)
Apriori

Parameter specification:
 confidence minval smax arem  aval originalSupport maxtime support minlen
         NA    0.1    1 none FALSE            TRUE       5    0.01      2
 maxlen            target   ext
     10 frequent itemsets FALSE

Algorithmic control:
 filter tree heap memopt load sort verbose
    0.1 TRUE TRUE  FALSE TRUE    2    TRUE

Absolute minimum support count: 2658 

set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[199 item(s), 265877 transaction(s)] done [0.46s].
sorting and recoding items ... [120 item(s)] done [0.12s].
creating transaction tree ... done [0.28s].
checking subsets of size 1 2 3 4 done [12.09s].
writing ... [260173 set(s)] done [0.04s].
creating S4 object  ... done [0.17s].
In [31]:
head(freq.items.df, 5)
tail(freq.items.df, 5)
item_setsupport.supportsupport.count
{JourneyPurpose = Taking pupil to/from school,VehicleLocation = On main c'way - not in restricted lane}0.01011370 2689
{JourneyPurpose = Taking pupil to/from school,PedestrianCrossingControl = None within 50 metres} 0.01009865 2685
{JourneyPurpose = Taking pupil to/from school,LeftHandDrive = No} 0.01034689 2751
{DidPoliceOfficerAttend = Yes,Severity = Fatal} 0.01062521 2825
{Severity = Fatal,VehicleLocation = On main c'way - not in restricted lane} 0.01042211 2771
item_setsupport.supportsupport.count
260169{DidPoliceOfficerAttend = Yes,PedestrianCrossingControl = None within 50 metres,Severity = Slight,VehicleLocation = On main c'way - not in restricted lane} 0.6782986 180344
260170{DidPoliceOfficerAttend = Yes,LeftHandDrive = No,Severity = Slight,VehicleLocation = On main c'way - not in restricted lane} 0.6786559 180439
260171{DidPoliceOfficerAttend = Yes,LeftHandDrive = No,PedestrianCrossingControl = None within 50 metres,Severity = Slight} 0.6873855 182760
260172{DidPoliceOfficerAttend = Yes,LeftHandDrive = No,PedestrianCrossingControl = None within 50 metres,VehicleLocation = On main c'way - not in restricted lane}0.8010772 212988
260173{LeftHandDrive = No,PedestrianCrossingControl = None within 50 metres,Severity = Slight,VehicleLocation = On main c'way - not in restricted lane} 0.8317380 221140

View Rules

In [32]:
# Let us now examine the rules
confidence <- 0.2 # Interest Measure
parameters = list(
  support = support,
  confidence = confidence,
  minlen  = 4,  # Minimal number of items per item set
  maxlen  = 10, # Maximal number of items per item set
  target  = "rules"
)

rules <- apriori(transactions.obj, parameter = parameters)
rules.df <- data.frame(rules = labels(rules)
                       ,rules@quality)

head(rules.df)
tail(rules.df)
Apriori

Parameter specification:
 confidence minval smax arem  aval originalSupport maxtime support minlen
        0.2    0.1    1 none FALSE            TRUE       5    0.01      4
 maxlen target   ext
     10  rules FALSE

Algorithmic control:
 filter tree heap memopt load sort verbose
    0.1 TRUE TRUE  FALSE TRUE    2    TRUE

Absolute minimum support count: 2658 

set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[199 item(s), 265877 transaction(s)] done [0.48s].
sorting and recoding items ... [120 item(s)] done [0.12s].
creating transaction tree ... done [0.32s].
checking subsets of size 1 2 3 4 done [8.62s].
writing ... [696492 rule(s)] done [0.10s].
creating S4 object  ... done [0.51s].
rulessupportconfidenceliftcount
{DidPoliceOfficerAttend = Yes,Severity = Fatal,VehicleLocation = On main c'way - not in restricted lane} => {PedestrianCrossingControl = None within 50 metres}0.01029047 0.9963583 1.0019394 2736
{DidPoliceOfficerAttend = Yes,PedestrianCrossingControl = None within 50 metres,Severity = Fatal} => {VehicleLocation = On main c'way - not in restricted lane}0.01029047 0.9722814 0.9933609 2736
{PedestrianCrossingControl = None within 50 metres,Severity = Fatal,VehicleLocation = On main c'way - not in restricted lane} => {DidPoliceOfficerAttend = Yes}0.01029047 0.9909453 1.2017459 2736
{DidPoliceOfficerAttend = Yes,Severity = Fatal,VehicleLocation = On main c'way - not in restricted lane} => {LeftHandDrive = No} 0.01024158 0.9916242 0.9959357 2723
{DidPoliceOfficerAttend = Yes,LeftHandDrive = No,Severity = Fatal} => {VehicleLocation = On main c'way - not in restricted lane} 0.01024158 0.9731951 0.9942944 2723
{LeftHandDrive = No,Severity = Fatal,VehicleLocation = On main c'way - not in restricted lane} => {DidPoliceOfficerAttend = Yes} 0.01024158 0.9909025 1.2016939 2723
rulessupportconfidenceliftcount
696487{DidPoliceOfficerAttend = Yes,LeftHandDrive = No,PedestrianCrossingControl = None within 50 metres} => {VehicleLocation = On main c'way - not in restricted lane}0.8010772 0.9811679 1.0024400 212988
696488{LeftHandDrive = No,PedestrianCrossingControl = None within 50 metres,VehicleLocation = On main c'way - not in restricted lane} => {DidPoliceOfficerAttend = Yes}0.8010772 0.8265280 1.0023526 212988
696489{PedestrianCrossingControl = None within 50 metres,Severity = Slight,VehicleLocation = On main c'way - not in restricted lane} => {LeftHandDrive = No} 0.8317380 0.9955566 0.9998852 221140
696490{LeftHandDrive = No,Severity = Slight,VehicleLocation = On main c'way - not in restricted lane} => {PedestrianCrossingControl = None within 50 metres} 0.8317380 0.9944597 1.0000301 221140
696491{LeftHandDrive = No,PedestrianCrossingControl = None within 50 metres,Severity = Slight} => {VehicleLocation = On main c'way - not in restricted lane} 0.8317380 0.9797050 1.0009454 221140
696492{LeftHandDrive = No,PedestrianCrossingControl = None within 50 metres,VehicleLocation = On main c'way - not in restricted lane} => {Severity = Slight} 0.8317380 0.8581629 1.0006788 221140
In [33]:
rules
set of 696492 rules 

The rules generated above utilized the following parameters: support value of 0.01 and confidence value of 0.2. These rules are more general, because the next section focuses on rules for each severity level. The lift value denotes the performance of the model at predicting or classifying measured against random chance. It can more simply be broken down to the ratio of the target response divided by the average response. A lift values less than 1 denotes a negative association, while a value greater than one denotes a positive association and likely more useful rule.

In this case, from the head we can see one rule that sticks out with a lift greater than 1. That rule is: {Severity = Fatal} => {DidPoliceOfficerAttend = Yes}. This would make sense, that a police officer would attend the scene of a fatal car accident. No rule truly stands out in the tail, as a lift value of 1 generally implies the probability of occurrence of the antecedent and consequent are independent.

In [34]:
rules.sorted <- sort(rules, by='lift')
inspect(rules.sorted[1:10], n=10)
     lhs                                                           rhs                       support confidence     lift count
[1]  {PropulsionCode = Petrol,                                                                                                
      UrbanorRuralArea = Urban,                                                                                               
      VehicleType = Motorcycle 125cc and under}                 => {EngineCapacity = Low} 0.01837316  0.9920796 18.60557  4885
[2]  {DriverHomeAreaType = Urban area,                                                                                        
      PropulsionCode = Petrol,                                                                                                
      VehicleType = Motorcycle 125cc and under}                 => {EngineCapacity = Low} 0.01838068  0.9916802 18.59808  4887
[3]  {PointOfImpact = Front,                                                                                                  
      PropulsionCode = Petrol,                                                                                                
      VehicleType = Motorcycle 125cc and under}                 => {EngineCapacity = Low} 0.01349496  0.9914341 18.59346  3588
[4]  {PropulsionCode = Petrol,                                                                                                
      SpeedLimit = 30,                                                                                                        
      VehicleType = Motorcycle 125cc and under}                 => {EngineCapacity = Low} 0.01850856  0.9913376 18.59165  4921
[5]  {JunctionDetail = T or staggered junction,                                                                               
      PropulsionCode = Petrol,                                                                                                
      VehicleType = Motorcycle 125cc and under}                 => {EngineCapacity = Low} 0.01005728  0.9911045 18.58728  2674
[6]  {PropulsionCode = Petrol,                                                                                                
      RoadType = Single carriageway,                                                                                          
      VehicleType = Motorcycle 125cc and under}                 => {EngineCapacity = Low} 0.01953159  0.9910305 18.58589  5193
[7]  {DidPoliceOfficerAttend = Yes,                                                                                           
      PropulsionCode = Petrol,                                                                                                
      VehicleType = Motorcycle 125cc and under}                 => {EngineCapacity = Low} 0.02098339  0.9909414 18.58422  5579
[8]  {JunctionControl = Give way or uncontrolled,                                                                             
      PropulsionCode = Petrol,                                                                                                
      VehicleType = Motorcycle 125cc and under}                 => {EngineCapacity = Low} 0.01479255  0.9909297 18.58400  3933
[9]  {PropulsionCode = Petrol,                                                                                                
      VehicleLocation = On main c'way - not in restricted lane,                                                               
      VehicleType = Motorcycle 125cc and under}                 => {EngineCapacity = Low} 0.02331529  0.9900974 18.56839  6199
[10] {LeftHandDrive = No,                                                                                                     
      PropulsionCode = Petrol,                                                                                                
      VehicleType = Motorcycle 125cc and under}                 => {EngineCapacity = Low} 0.02359362  0.9900568 18.56763  6273

Sample Plot

A simple plot of the first 1000 rules are shown in following plot. More detailed plots will be discussed in the following sections.

In [35]:
plot(rules.sorted[1:1000], method = NULL, measure = "support", shading = "lift", engine='plotly')
To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

Data Preparation

Modeling and Evaluation 2

Points: 10

Evaluate and Compare.

Compare the rules for Accident Severities

In this section we wanted to compare the association rules for the various accident severities (Fatal, Serious and Slight). The dataset is unbalanced in the number of observations for each of the accident severities as shown in the bar plot below. Therefore, we had to tune the confidence and the lift parameters to the apriori algorithm individually for each of the severity to get the relevant rules.

In [36]:
# counts 
g <- ggplot(accident.df, aes(Severity))
g + geom_bar()

Association Rules for Serious Accidents

In [37]:
serious.rules <- apriori(transactions.obj, parameter = list(minlen=3, supp=0.00047, conf=0.3), 
                         appearance = list(rhs=c('Severity = Serious'), default='lhs'), control = list(verbose=F))
serious.rules
serious.rules.sorted <- sort(serious.rules, by='lift')
inspect(serious.rules.sorted[1:10], n=10)
set of 243 rules 
     lhs                                                    rhs                       support confidence     lift count
[1]  {EngineCapacity = Low,                                                                                            
      LightCondition = Daylight,                                                                                       
      VehicleLeavingCarriageway = Offside}               => {Severity = Serious} 0.0005641707  0.4838710 3.674039   150
[2]  {EngineCapacity = Low,                                                                                            
      VehicleLeavingCarriageway = Offside,                                                                             
      WeatherCondition = Fine no high winds}             => {Severity = Serious} 0.0006318711  0.4813754 3.655090   168
[3]  {EngineCapacity = Low,                                                                                            
      RoadSurfaceCondition = Dry,                                                                                      
      VehicleLeavingCarriageway = Offside}               => {Severity = Serious} 0.0005566484  0.4789644 3.636784   148
[4]  {EngineCapacity = Low,                                                                                            
      RoadType = Single carriageway,                                                                                   
      VehicleLeavingCarriageway = Offside}               => {Severity = Serious} 0.0006393934  0.4722222 3.585590   170
[5]  {DidPoliceOfficerAttend = Yes,                                                                                    
      EngineCapacity = Low,                                                                                            
      VehicleLeavingCarriageway = Offside}               => {Severity = Serious} 0.0006882882  0.4704370 3.572035   183
[6]  {EngineCapacity = Low,                                                                                            
      Sex = Male,                                                                                                      
      VehicleLeavingCarriageway = Offside}               => {Severity = Serious} 0.0006657214  0.4694960 3.564890   177
[7]  {EngineCapacity = Low,                                                                                            
      VehicleLeavingCarriageway = Offside}               => {Severity = Serious} 0.0007033327  0.4617284 3.505910   187
[8]  {EngineCapacity = Low,                                                                                            
      PropulsionCode = Petrol,                                                                                         
      VehicleLeavingCarriageway = Offside}               => {Severity = Serious} 0.0007033327  0.4617284 3.505910   187
[9]  {EngineCapacity = Low,                                                                                            
      LeftHandDrive = No,                                                                                              
      VehicleLeavingCarriageway = Offside}               => {Severity = Serious} 0.0007033327  0.4617284 3.505910   187
[10] {EngineCapacity = Low,                                                                                            
      PedestrianCrossingControl = None within 50 metres,                                                               
      VehicleLeavingCarriageway = Offside}               => {Severity = Serious} 0.0006995716  0.4615385 3.504468   186

The above are rules generated using the Apriori algorithm for accidents with a serious severity. Seeing as the rules are sorted by lift, we are seeing rules with a lift significantly greater than 1, indicating positive associations and more useful rules. Most of the visible rules invole a low engine capacity, vehicle leaving the carriageway offside, and some other factor contributing to a severity of serious. It is also important to note the count on the right side, showing how often this rule appears in the dataset, with all the visible rules appearing over 150 times. These rules were generated with a support of 0.00047 and confidence of 0.3.

Association Rules for Fatal Accidents

In [40]:
fatal.rules <- apriori(transactions.obj, parameter = list(minlen=3, supp=0.00001, conf=0.4), 
                       appearance = list(rhs=c('Severity = Fatal'), default='lhs'), control = list(verbose=F))
fatal.rules
fatal.rules.sorted <- sort(fatal.rules, by='lift')
inspect(fatal.rules.sorted[1:10], n=10)
set of 94 rules 
     lhs                                                                                 rhs                     support confidence     lift count
[1]  {VehicleLocation = On lay-by or hard shoulder,                                                                                               
      WeatherCondition = Fog or mist}                                                 => {Severity = Fatal} 1.128341e-05          1 93.29018     3
[2]  {PoliceForce = Greater Manchester,                                                                                                           
      SpecialConditionsAtSite = Road sign or marking defective or obscured,                                                                       
      SpeedLimit = 20}                                                                => {Severity = Fatal} 1.128341e-05          1 93.29018     3
[3]  {CarriagewayHazards = Previous accident,                                                                                                     
      PoliceForce = Cumbria,                                                                                                                      
      WeatherCondition = Fog or mist}                                                 => {Severity = Fatal} 1.128341e-05          1 93.29018     3
[4]  {CarriagewayHazards = Previous accident,                                                                                                     
      LightCondition = Darkness - no lighting,                                                                                                    
      WeatherCondition = Fog or mist}                                                 => {Severity = Fatal} 1.128341e-05          1 93.29018     3
[5]  {CarriagewayHazards = Previous accident,                                                                                                     
      JunctionDetail = More than 4 arms (not roundabout),                                                                                         
      RoadClass2 = C}                                                                 => {Severity = Fatal} 1.128341e-05          1 93.29018     3
[6]  {CarriagewayHazards = Previous accident,                                                                                                     
      JunctionDetail = More than 4 arms (not roundabout),                                                                                         
      SpeedLimit = 70}                                                                => {Severity = Fatal} 1.128341e-05          1 93.29018     3
[7]  {CarriagewayHazards = Previous accident,                                                                                                     
      JunctionDetail = More than 4 arms (not roundabout),                                                                                         
      JunctionLocation = Approaching junction or waiting/parked at junction approach} => {Severity = Fatal} 1.128341e-05          1 93.29018     3
[8]  {CarriagewayHazards = Previous accident,                                                                                                     
      JunctionDetail = More than 4 arms (not roundabout),                                                                                         
      UrbanorRuralArea = Rural}                                                       => {Severity = Fatal} 1.128341e-05          1 93.29018     3
[9]  {CarriagewayHazards = Previous accident,                                                                                                     
      JunctionDetail = More than 4 arms (not roundabout),                                                                                         
      LightCondition = Daylight}                                                      => {Severity = Fatal} 1.128341e-05          1 93.29018     3
[10] {CarriagewayHazards = Previous accident,                                                                                                     
      RoadClass2 = C,                                                                                                                             
      SpeedLimit = 70}                                                                => {Severity = Fatal} 1.128341e-05          1 93.29018     3

The above are rules generated using the Apriori algorithm for accidents with a fatal severity. Seeing as the rules are sorted by lift, we are seeing rules with the largest lift values. Most of the visible rules involve carriage way hazards of a previous accident, junction detail of more than 4 arms, some have a speed limit of 70, and fog or mist weather conditions. Based on these factors, the rules do seem likely to involve fatal accidents. One would assume that bad weather would play a factor in driivng conditions, and driving fast would result in a more severe incident. It is good to note that the hazards show previous accidents had occured at these locations. These rules were generated with a low support of 0.00001 and confidence of 0.4.

Association Rules for Slight Accidents

In [43]:
slight.rules <- apriori(transactions.obj, parameter = list(minlen=3, supp=0.6, conf=0.8), 
                        appearance = list(rhs=c('Severity = Slight'), default='lhs'), control = list(verbose=F))
slight.rules
slight.rules.sorted <- sort(slight.rules, by='lift')
inspect(slight.rules.sorted[1:10], n=10)
set of 46 rules 
     lhs                                                           rhs                   support confidence     lift  count
[1]  {VehicleLocation = On main c'way - not in restricted lane,                                                            
      VehicleType = Car}                                        => {Severity = Slight} 0.6243789  0.8782981 1.024158 166008
[2]  {LeftHandDrive = No,                                                                                                  
      VehicleLocation = On main c'way - not in restricted lane,                                                            
      VehicleType = Car}                                        => {Severity = Slight} 0.6228594  0.8782842 1.024142 165604
[3]  {PedestrianCrossingControl = None within 50 metres,                                                                   
      VehicleLocation = On main c'way - not in restricted lane,                                                            
      VehicleType = Car}                                        => {Severity = Slight} 0.6208623  0.8782441 1.024095 165073
[4]  {LeftHandDrive = No,                                                                                                  
      PedestrianCrossingControl = None within 50 metres,                                                                   
      VehicleLocation = On main c'way - not in restricted lane,                                                            
      VehicleType = Car}                                        => {Severity = Slight} 0.6193503  0.8782266 1.024075 164671
[5]  {LeftHandDrive = No,                                                                                                  
      VehicleType = Car}                                        => {Severity = Slight} 0.6310286  0.8777788 1.023552 167776
[6]  {PedestrianCrossingControl = None within 50 metres,                                                                   
      VehicleType = Car}                                        => {Severity = Slight} 0.6289863  0.8777437 1.023511 167233
[7]  {LeftHandDrive = No,                                                                                                  
      PedestrianCrossingControl = None within 50 metres,                                                                   
      VehicleType = Car}                                        => {Severity = Slight} 0.6274405  0.8777195 1.023483 166822
[8]  {LightCondition = Daylight,                                                                                           
      VehicleLocation = On main c'way - not in restricted lane} => {Severity = Slight} 0.6300432  0.8662158 1.010069 167514
[9]  {LightCondition = Daylight,                                                                                           
      PedestrianCrossingControl = None within 50 metres,                                                                   
      VehicleLocation = On main c'way - not in restricted lane} => {Severity = Slight} 0.6263009  0.8661947 1.010044 166519
[10] {LeftHandDrive = No,                                                                                                  
      LightCondition = Daylight,                                                                                           
      VehicleLocation = On main c'way - not in restricted lane} => {Severity = Slight} 0.6271321  0.8660873 1.009919 166740

The above are rules generated using the Apriori algorithm for accidents with a slight severity. Seeing as the rules are sorted by lift, we are seeing rules with a lift close to 1, which may or may not indicate associations usefulness. Most of the visible rules invole a location on a main road, right hand drive, daylight conditions, and some other factor contributing to a severity of slight. It is also important to note the count on the right side, showing how often this rule appears in the dataset, with all the visible rules appearing over 166,000 times. These rules were generated with a support of 0.6 and confidence of 0.8. If the support value is too low, there will be numerous rules, since there were more slight accidents in the dataset.

There were 46 rules generated for the accidents with a slight severity. The rules are plotted above demonstrating the evaluation metrics of support and lift. The darker colors demonstrate a higher lift, and those rules tend to have lower support.

Modeling and Evaluation 3

Points: 10

Visualize Results.

Visualization of Rules for Serious Accidents

In [70]:
# Scatter plot of rules
plot(serious.rules.sorted, method = "scatterplot", measure = c("support","lift"), shading = "order", jitter=0, engine="plotly")

Data Preparation

In [62]:
# Interactive scatter plots
plot(serious.rules.sorted, method = NULL, measure = "support", shading = "lift", engine='pl')
To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

Data Preparation

In [78]:
plot(head(sort(serious.rules.sorted, by="lift"), 200), method = "grouped")
In [81]:
plot(serious.rules[1:15], method = "paracoord")

The visualization of the rules for serious accidents show that the light conditions, speed limit and deprivation play a major role.

Visualization of Rules for Fatal Accidents

In [83]:
# Scatter plot of rules
plot(fatal.rules.sorted, method = "scatterplot", measure = c("support","lift"), shading = "order", jitter=0, engine="plotly")

Data Preparation

In [84]:
# Interactive scatter plots
plot(fatal.rules.sorted, method = NULL, measure = "support", shading = "lift", engine='pl')
To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

Data Preparation

In [87]:
plot(head(sort(fatal.rules.sorted, by="lift"), 100), method = "grouped")
In [89]:
plot(fatal.rules[1:10], method = "paracoord")

The visualization of the rules for fatal accidents show that the light conditions, speed limit and deprivation play a major role. They also show that the males are more prone to be involved in fatal accidents than females.

Visualization of Rules for Slight Accidents

In [90]:
# Scatter plot of rules
plot(slight.rules.sorted, method = "scatterplot", measure = c("support","lift"), shading = "order", jitter=0, engine="plotly")

Data Preparation

In [91]:
# Interactive scatter plots
plot(slight.rules.sorted, method = NULL, measure = "support", shading = "lift", engine='pl')

Data Preparation

In [92]:
plot(head(sort(slight.rules.sorted, by="lift"), 100), method = "grouped")
In [93]:
plot(slight.rules[1:10], method = "paracoord")

The visualization of the rules for slight accidents show that all the factors can come into play in the cause of the accidents.

Modeling and Evaluation 4

Points: 20

Summarize the Ramifications.

Ramifications of Rules for Serious Accidents

In [94]:
plot(head(sort(serious.rules.sorted, by="lift"), 25), method="graph", control=list(cex=.7))

Based on earlier analysis, we saw the trend that a lower value of support and larger value of confidence resulted in the highest lift values. This graph demonstrates those rules with a deeper opaque color. These rules maintain the analysis, as the smaller circles represent lower support values. Some useful rules involve fine and daylight environmental conditions, vehicle leaving the carriageway offside, vehicle location on a main roadway, and a low engine capacity.

Ramifications of Rules for Fatal Accidents

In [65]:
plot(head(sort(fatal.rules.sorted, by="lift"), 25), method="graph", control=list(cex=.7))

This plot is extremely unique in that all the rules shown seem to share the same support and lift values. However, I think this is due to the effect of having very small parameter values. The support value used in rule generation was 0.00001 for fatal accidents, which seems to have been rounded to 0 for the purposes of the graph. In this case, we will defer to the results seen above in the rule generation. Due to there being such few data points for fatal accidents, a small support value was used to generate rules.

Ramifications of Rules for Slight Accidents

In [66]:
plot(head(sort(slight.rules.sorted, by="lift"), 25), method="graph", control=list(cex=.7))

This plot also supports the above assertion that a lower support value and larger confidence value results in higher lift values, demonstrated by the small opaque circles. Some of the recurring factors in the generated rules for slight accidents are: vehicle type car, right hand drive, no pedestrian crossing within 50 meters, and vehicle location on a main roadway. By just looking at these rules, some seem to be a bit implied, such as the vehicle being a car and having right hand drive in the UK. This is in part due to the large number of slight accidents in the dataset, which is the majority. For the slight analysis, a higher support value of 0.6 is used to reduce the number of rules generated.

Deployment

Points: 10

Be critical of your performance and tell the reader how you current model might be usable by other parties. Did you achieve your goals? If not, can you reign in the utility of your modeling?

  • How useful is your model for interested parties (i.e., the companies or organizations that might want to use it)?
  • How would your deploy your model for interested parties?
  • What other data should be collected?
  • How often would the model need to be updated, etc.?

Model Usefulness

The goal of this analysis was to establish the association rules that can be derived from the UK Accidents data from 2012. Specifically, we wanted to examine the factors in the rules that are related to the severity of the numerous accidents. In this respect, our goal was acheived in generating the varios rules associated with the slight, serious, and fatal severities with the associated statistics to determine usefulness.

The rules generated are absolutely usable by various stakeholders, especially those rules that are deemed more useful based on the evaluation metrics. Normal drivers that are concerned with safety of routes can use the information to decide what types of junctions or intersections to avoid on commutes. First responders and health professionals can get an idea of what factors lead to different severities of accidents, and better prepare in those areas that could be considered high risk based on the association rules. Planners and developers can utilize the information from this analysis to better understand the specific factors surrounding different types of accidents and work to avoid them as best possible. The association rules not only provide information into the severity of accidents, but the factors surrounding them; there are countless parties that could benefit from and utilize the findings.

Model Deployment

This model would be best deployed as an online resource or in a consulting capacity. As an accessible resource, this would allow a variety of parties to utilize the information to meet their needs. For industries or companies looking to fully understand accident severity in the UK, they could hire us on as consultants to analyze in more depth and create visualizations based on specific scenarios. The assocaition rules could also be altered to meet the need, moving beyond severity, and diving into the subsets of different factors.

Data Collection

To further improve the models, more data is always helpful in drawing conclusions on usefulness for the generated association rules. More data regarding serious and fatal accidents would allow for the same sense of confidence as in the slight accident results. With additional data collected on the type of accident, such as fender bender or head on collision, that could provide for more granularity in analysis of the assocation rules.

Model Updation

For reliable results, the model should be updated every quarter or 3 months to ensure the most recent data is factored into the associaton rules. This up to date information provides more assurance in the results and conclusions drawn.

Exceptional Work

Points: 10

You have free reign to provide additional analyses or combine analyses.

Graphical visualization of accidents using google maps

The observations for the accidents have the latitude and logitude coordinates associated with the place where the accident occured. We used google maps to plot the accidents for the entire United Kingdom and also some specific cities to see how the accidents are spread out in the various areas.

In order to use google maps in our code, we had to follow the following steps:

  • Create a google cloud account
  • Create a new project
  • Enable billing for the project
  • Enable the Map APIs for the project
  • Create key credentials
  • Restrict the key credentials for the use with the map APIs only
  • Copy the key and invoke register_google function before invoking the map APIs
In [49]:
options(warn=-1)
library(ggmap)
library(ggplot2)
Google's Terms of Service: https://cloud.google.com/maps-platform/terms/.
Please cite ggmap if you use it! See citation("ggmap") for details.
In [50]:
register_google(key="AIzaSyBF0uwfKOPUzLJ8l1DKaaSi5fX6mY65xlA")
In [51]:
UKMap <- get_map(location = c(lon = -3, lat = 54), color = "color", source = "google", maptype = "terrain", 
                 zoom = 6)
UKggMap <- ggmap(UKMap, extent = "device", ylab = "Latitude", xlab = "Longitude")
UKggMap + geom_point(aes(x = Longitude, y = Latitude, colour= Accident.Severity, fill = Accident.Severity), 
                             data = accident.data)
Source : https://maps.googleapis.com/maps/api/staticmap?center=54,-3&zoom=6&size=640x640&scale=2&maptype=terrain&language=en-EN&key=xxx

The entire map shows the slight accidents spread out throughout the entire country, the serious in some concentrated places and the fatal ones near the cities.

In [52]:
LondonMap <- get_map(location = c(lon = -0.16179, lat = 51.538525), color = "color", source = "google", maptype = "terrain", 
                     zoom = 12)
LondonggMap <- ggmap(LondonMap, extent = "device", ylab = "Latitude", xlab = "Longitude")
LondonggMap + geom_point(aes(x = Longitude, y = Latitude, colour= Accident.Severity, fill = Accident.Severity), 
                             data = accident.data)
Source : https://maps.googleapis.com/maps/api/staticmap?center=51.538525,-0.16179&zoom=12&size=640x640&scale=2&maptype=terrain&language=en-EN&key=xxx

The map of London shows the slight accidents in most of the main roads. They seem to occur closer to the city center or downtown.

In [56]:
ManchesterMap <- get_map(location = c(lon = -2.244644, lat = 53.4808), color = "color", source = "google", maptype = "terrain", 
                         zoom = 12)
ManchesterggMap <- ggmap(ManchesterMap, extent = "device", ylab = "Latitude", xlab = "Longitude")
ManchesterggMap + geom_point(aes(x = Longitude, y = Latitude, colour= Accident.Severity, fill = Accident.Severity), 
                             data = accident.data)
Source : https://maps.googleapis.com/maps/api/staticmap?center=53.4808,-2.244644&zoom=12&size=640x640&scale=2&maptype=terrain&language=en-EN&key=xxx

Manchester map shows accidents concentrated on the city center.

In [95]:
BirminghamMap <- get_map(location = c(lon = -1.898575, lat = 52.48947), color = "color", source = "google", maptype = "terrain", 
                         zoom = 12)
BirminghamggMap <- ggmap(BirminghamMap, extent = "device", ylab = "Latitude", xlab = "Longitude")
BirminghamggMap + geom_point(aes(x = Longitude, y = Latitude, colour= Accident.Severity, fill = Accident.Severity), 
                             data = accident.data)
Source : https://maps.googleapis.com/maps/api/staticmap?center=52.48947,-1.898575&zoom=12&size=640x640&scale=2&maptype=terrain&language=en-EN&key=xxx

Birmingham shows the accidents are evenly spread out on all parts of the city.

References

D. Kahle and H. Wickham. ggmap: Spatial Visualization with ggplot2. The R Journal, 5(1), 144-161. URL http://journal.r-project.org/archive/2013-1/kahle-wickham.pdf

MSDS 7331 - Lab Three: Clustering, Association Rules, or Recommenders, Investigators: Matt Baldree, Ben Brock, Tom Elkins, Austin Kelly